Concrete is the most used building material throughout the world today. Concrete is commonly composed of portland cement (“C”), blast furnace slag (“S”), fly ash (“F”), water (“W”), superplasticizer (“SP”), coarse aggregates (“CA”), fine aggregates (“FA”). The purpose of this research is determine which concrete components have a significant effect on the compressive strength. Using Ordinary Least Squares method, the regression model was determined to be \(\hat{strength}\) = 89.5305187 + 0.0802058 \(C\) + 0.0675839 \(F\) -0.1940091 \(W\) -0.0367414 \(CA\) -0.0152489 \(FA\). This indicates that as the portland cement (“C”) content increases, fly ash (“F”) content increases, water (“W”) content decreases, or aggregate (“CA” and “FA”) content decreases, the concrete compressive strength increases, accounting for the effects of the other variables in the model. This model accounts for 88% of the variability in concrete compressive strength and is more appropriate than using the mean compressive strength. Furthermore upon validation, the model accounts 90% of the response variability when predicting new concrete strengths from the model and has a low error rate of 8.1%, indicating the model is appropriate when predicting the concrete compressive strength from the amounts of concrete components.
The data set was retrieved from UCI Machine Learning Repository on November 12, 2019. Data Reference: Yeh, I-Cheng, “Modeling slump flow of concrete using second-order regressions and artificial neural networks,” Cement and Concrete Composites, Vol.29, No. 6, 474-480, 2007.
In this day and age, we are surrounded by a concrete jungle –in our buildings, our roads, our pipelines– are all made possible thanks to this wonderful material!
Concrete is a composite material composed of aggregates and cement or simply put “rocks glued together”. The beauty of composites is they have unique properties that individual components do not possess on their own. The aggregates reinforce the surrounding cement creating a strong material. But what makes concrete so strong? Is there are an optinum mixture of components?
The Concrete Jungle
Image Source: Edward Burtynsky, twittersifter.com twistedsifter.com/2012/03/picture-of-the-day-the-concrete-jungle/
The following components have an effect on the compressive strength (“strength”), flow (“Fl”), and slump (“Sl”) of the concrete material:
The data represents the amount of each component (kilograms) in a cubic meter of concrete.
Image Source: Paulo Montiero, UC Berkely
Three concrete properties were investigated given various amounts of concrete components:
Concrete compressive strength (“Strength”)
Slump (“Sl”)
Flow (“Fl”)
Concrete Slump Test
Image Source: theconstructor.org/concrete/concrete-slump-test/1558/
The concrete data was seperated into two data sets– 80% of the data was used to train the model (black) and 20% of the data was used to test the data (red).
A linear model relating portland cement (“C”), blast furnace slag (“S”), fly ash (“F”), water (“W”), superplasticizer (“SP”), coarse aggregates (“CA”), fine aggregates (“FA”) to concrete compressive strength (“Strength”) was created.
Call:
lm(formula = Strength ~ C + S + F + W + SP + CA + FA, data = train_data)
Residuals:
Min 1Q Median 3Q Max
-5.5113 -1.4818 -0.1948 1.2946 7.5805
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 123.25103 79.20910 1.556 0.12397
C 0.06870 0.02504 2.743 0.00763 **
S -0.01870 0.03529 -0.530 0.59787
F 0.05611 0.02566 2.187 0.03190 *
W -0.22442 0.08030 -2.795 0.00661 **
SP 0.10348 0.15451 0.670 0.50512
CA -0.05005 0.03045 -1.644 0.10450
FA -0.03017 0.03224 -0.936 0.35240
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.59 on 74 degrees of freedom
Multiple R-squared: 0.8901, Adjusted R-squared: 0.8797
F-statistic: 85.62 on 7 and 74 DF, p-value: < 2.2e-16
The final model was determined by removing the variable with the highest p-value from the t-test until all variables were significant at a confindence level of 0.05. The following variables were determined to have a significant effect on the compressive strength (“Strength”) of concrete (from the t-tests):
The resulting linear model is:
Concrete strength increases as:
Using the training data, 87.95% of the variability in strength is accounted for using the model.
Using the test data to validate the model,
For the regression model to be appropriate, the following assumptions or conditions must be valid:
Additionally, influential points and collinearity should be assessed.
Conclusion
From reviewing the conditions, the linear model is appropriate as the conditions are valid and there are no influential points or collinearity between variables.
Using the Residuals vs. Fitted Plot, there are no distinct patterns, which is an indication of a linear relationship between the concrete compressive strength (\(strength\)) and the regressors, portland cement (\(C\)), fly ash (\(F\)) and water (\(W\)).
In Ordinary Least Squares (OLS) regression, the linear regression is created with the error term having a mean of zero and thus, this condition is always satisfied.
Assuming the data is indexed in the order it was collected, the Indexed Residuals Plot indicates the residuals have no specific patterns, suggesting the residuals are independent. Without knowing the order in which the data was collected, the independent condition cannot be confirmed.
Using the Scale-Location Plot, the points appear to be random with no indication of a pattern, suggesting the constant variance condition is satisfied.
Using the Q-Q Plot, the standarized residuals lie along the 45 degree line, indicating the normality condition is satisfied.
Using the Cook’s D Plot, there are no points that have Cook’s D values greater than 1, suggesting they are not any suspected influential points. Influential points can be an issue because they affect the model estimates. It is important to determine if these data points are “bad” values from mistakes in data collection or if they represent actual values. If these outliers are determined to be from mistakes, they can be removed from the sample data but should be mentioned in the analysis.
In addition to linearity between regressors and the response, it is also important to assess the near linear dependence amoung regressors.
C F W CA FA
1.287230 1.280103 1.375916 1.611531 1.277268
From the Variance Inflation Factors (VIF) for the five regressors, none of the square root of VIF values are greater than 2, suggesting collinearity is not prescence.
Future Work
---
title: "MTH 542 Prj Concrete"
author: "K.M. Burzynski"
output:
flexdashboard::flex_dashboard:
theme: cosmo
orientation: columns
social: ["facebook", "twitter", "linkedin"]
source_code: embed
---
```{r setup, include=FALSE}
# load necessary packages
library(caret)
library(car)
library(ggplot2)
library(plotly)
library(plyr)
library(flexdashboard) ## you need this package to create dashboard
# read the data set here, I use data: mtcars as an example
concretedata <- read.csv("/Users/katherineburzynski/Documents/MTH 543 - Linear Regression/Project/Concrete_test.csv")
```
Introduction
=======================================================================
Column {data-width=600}
-----------------------------------------------------------------------
### Determing the Effect of Concrete Components on Concrete Properties
#### **Determing the Effect of Concrete Components on Concrete Properties**
Concrete is the most used building material throughout the world today. Concrete is commonly composed of portland cement (“C”), blast furnace slag (“S”), fly ash (“F”), water (“W”), superplasticizer (“SP”), coarse aggregates (“CA”), fine aggregates (“FA”). The purpose of this research is determine which concrete components have a significant effect on the compressive strength. Using Ordinary Least Squares method, the regression model was determined to be $\hat{strength}$ = 89.5305187 + 0.0802058 $C$ + 0.0675839 $F$ -0.1940091 $W$ -0.0367414 $CA$ -0.0152489 $FA$. This indicates that as the portland cement (“C”) content increases, fly ash (“F”) content increases, water (“W”) content decreases, or aggregate ("CA" and "FA") content decreases, the concrete compressive strength increases, accounting for the effects of the other variables in the model. This model accounts for 88% of the variability in concrete compressive strength and is more appropriate than using the mean compressive strength. Furthermore upon validation, the model accounts 90% of the response variability when predicting new concrete strengths from the model and has a low error rate of 8.1%, indicating the model is appropriate when predicting the concrete compressive strength from the amounts of concrete components.
The data set was retrieved from *UCI Machine Learning Repository* on November 12, 2019.
**Data Reference:**
Yeh, I-Cheng, "Modeling slump flow of concrete using second-order regressions and artificial neural networks," Cement and Concrete Composites, Vol.29, No. 6, 474-480, 2007.
### Is the compressive strength data normal?
```{r}
histstrength <- plot_ly(concretedata, x=~Strength)
ggplotly(histstrength)
```
Column {.tabset data-width=400}
-----------------------------------------------------------------------
### What makes concrete strong?
In this day and age, we are surrounded by a concrete jungle --in our buildings, our roads, our pipelines-- are all made possible thanks to this wonderful material!
Concrete is a composite material composed of aggregates and cement or simply put *"rocks glued together"*. The beauty of composites is they have unique properties that individual components do not possess on their own. The aggregates reinforce the surrounding cement creating a strong material. But what makes concrete so strong? Is there are an optinum mixture of components?

*Image Source: Edward Burtynsky, twittersifter.com*
twistedsifter.com/2012/03/picture-of-the-day-the-concrete-jungle/
### Concrete Components
The following components have an effect on the compressive strength ("strength"), flow ("Fl"), and slump ("Sl") of the concrete material:
* Portland Cement (“C”)
* Blast furnace slag (“S”)
* Fly ash (“F”)
* Water (“W”)
* Superplasticizer (“SP”)
* Coarse aggregates (“CA”)
* Fine aggregates (“FA”)
The data represents the amount of each component (kilograms) in a cubic meter of concrete.

*Image Source: Paulo Montiero, UC Berkely*
### Concrete Properties
Three concrete properties were investigated given various amounts of concrete components:
**Concrete compressive strength ("Strength")**
* concrete samples were tested in compression until they failed
* determines the strength of the cured composite
* reported in megapascals (MPa)
**Slump ("Sl")**
* slump is how much drop there is in the wet concrete during the "Slump-Cone Test"
* helps understand how easy the wet concrete is to work with
* measured in meters (m)
**Flow ("Fl")**
* flow is the diameter of the wet concrete cone during the "Slump-Cone Test"
* helps understand how easy the wet concrete is to work with
* measured in meters (m)

*Image Source: theconstructor.org/concrete/concrete-slump-test/1558/*
Response Variable Exploration
=======================================================================
Column
-----------------------------------------------------------------------
### Compressive Strength of Concrete ("Strength")
* determines the strength of the cured composite
* reported in megapascals (MPa)
``` {r}
boxstrength <- plot_ly(concretedata, x=~Strength, type="box")
ggplotly(boxstrength)
```
### Compressive Strength has a normal distribution
```{r}
histstrength <- plot_ly(concretedata, x=~Strength)
ggplotly(histstrength)
```
Column
-----------------------------------------------------------------------
### Slump ("Sl")
* slump is how much drop there is in the cone
* measured in meters (m)
```{r}
boxslump <- plot_ly(concretedata, x=~Sl, type="box")
ggplotly(boxslump)
```
### Slump has a left-skewed distribution
```{r}
histslump <- plot_ly(concretedata, x=~Sl)
ggplotly(histslump)
```
Column
-----------------------------------------------------------------------
### Flow ("Fl")
* flow is the diameter of the cone
* measured in meters (m)
```{r}
boxflow <- plot_ly(concretedata, x=~Fl, type="box")
ggplotly(boxflow)
```
### Flow has a bimodal distribution
```{r}
histflow <- plot_ly(concretedata, x=~Fl)
ggplotly(histflow)
```
Linear Model of Strength
=======================================================================
Column
-----------------------------------------------------------------------
### Concrete Compressive Strength Data
```{r}
set.seed(2019)
train_index=sample(1:103,82)
train_data=concretedata[train_index,]
test_data=concretedata[-train_index,]
```
```{r}
boxplot(train_data$Strength,test_data$Strength,main="Concrete Strength for Validation",names = c("train","test"), horizontal = TRUE, xlab="Concrete Strength (MPa)", col=c("black","red")
)
```
The concrete data was seperated into two data sets-- 80% of the data was used to train the model (black) and 20% of the data was used to test the data (red).
### Significant Variables Contributing to Concrete Compressive Strength
A linear model relating portland cement (“C”), blast furnace slag (“S”), fly ash (“F”), water (“W”), superplasticizer (“SP”), coarse aggregates (“CA”), fine aggregates (“FA”) to concrete compressive strength ("Strength") was created.
```{r}
cstrength=lm(Strength~C+S+F+W+SP+CA+FA,train_data)
summary(cstrength)
```
Column
-----------------------------------------------------------------------
### Linear Model of Concrete Compressive Strength
The final model was determined by removing the variable with the highest p-value from the t-test until all variables were significant at a confindence level of 0.05. The following variables were determined to have a significant effect on the compressive strength ("Strength") of concrete (from the t-tests):
* portland cement (“C”)
* fly ash (“F”)
* water (“W”)
* coarse aggregates (“CA”)
* fine aggregates (“FA”)
```{r}
css=lm(Strength~C+F+W+CA+FA,train_data)
```
**The resulting linear model is:**
#### $\hat{strength}=$ `r css$coefficients[1]` + `r css$coefficients[2]`$C$ + `r css$coefficients[3]`$F$ `r css$coefficients[4]`$W$ `r css$coefficients[5]`$CA$ `r css$coefficients[6]`$FA$.
**Concrete strength increases as:**
* portland cement (“C”) content increases
* fly ash (“F”) content increases
* water (“W”) content decreases
* coarse aggregates (“CA”) decreases
* fine aggregates (“FA”) decreases
#### **How good is this linear model at predicting the test data?**
Using the training data, 87.95% of the variability in strength is accounted for using the model.
```{r}
prd=predict(css,test_data)
Rsq=R2(prd,test_data$Strength)
Rootmean=RMSE(prd,test_data$Strength)/mean(test_data$Strength)
```
Using the test data to validate the model,
* `r Rsq*100` % of the varibility in strength is accounted for in the model
* There is `r Rootmean*100` % error rate
Model Conditions
=======================================================================
Column {.tabset data-width=400}
-----------------------------------------------------------------------
### Summary
For the regression model to be appropriate, the following assumptions or conditions must be valid:
1. Linear
2. Zero Mean
3. Equal Variance
4. Independent
5. Normality
Additionally, influential points and collinearity should be assessed.
*Conclusion*
From reviewing the conditions, the linear model is appropriate as the conditions are valid and there are no influential points or collinearity between variables.
### 1. Linearality Condition
```{r}
plot(css,1)
```
Using the ***Residuals vs. Fitted Plot***, there are no distinct patterns, which is an indication of a linear relationship between the concrete compressive strength ($strength$) and the regressors, portland cement ($C$), fly ash ($F$) and water ($W$).
### 2. Zero Mean Condition
In Ordinary Least Squares (OLS) regression, the linear regression is created with the error term having a mean of zero and thus, this condition is always satisfied.
### 3. Independent Condition
```{r}
plot(css$residuals, main="Indexed Residuals",ylab="residual")
```
Assuming the data is indexed in the order it was collected, the ***Indexed Residuals Plot*** indicates the residuals have no specific patterns, suggesting the residuals are independent. Without knowing the order in which the data was collected, the independent condition cannot be confirmed.
### 4. Equal Variance (aka Constant Variance) Condition
```{r}
plot(css,3)
```
Using the ***Scale-Location Plot***, the points appear to be random with no indication of a pattern, suggesting the constant variance condition is satisfied.
### 5. Normality Condition
```{r}
plot(css,2)
```
Using the ***Q-Q Plot***, the standarized residuals lie along the 45 degree line, indicating the normality condition is satisfied.
### Influential Points
```{r}
plot(css,4)
```
Using the *Cook's D Plot*, there are no points that have Cook's D values greater than 1, suggesting they are not any suspected influential points. Influential points can be an issue because they affect the model estimates. It is important to determine if these data points are "bad" values from mistakes in data collection or if they represent actual values. If these outliers are determined to be from mistakes, they can be removed from the sample data but should be mentioned in the analysis.
### Collinearity
In addition to linearity between regressors and the response, it is also important to assess the near linear dependence amoung regressors.
```{r}
sqrt(vif(css))
```
From the *Variance Inflation Factors* (VIF) for the five regressors, none of the square root of VIF values are greater than 2, suggesting collinearity is not prescence.
Future Work
=======================================================================
Future Work
* investigating transformations to model the effect of concrete components on concrete slump and flow for the cone test
* looking into other models that may be more appropriate to model the concrete compression data